Word Boundary Decision with CRF for Chinese Word Segmentation
نویسندگان
چکیده
Chinese word segmentation systems necessarily perform both accurately and quickly for real applications. In this paper, we study on word boundary decision (WBD) approach for Chinese word segmentation and implement it as a 2-tag character tagging with conditional random filed (CRF). With a help of tag transition features, WBD with CRF segmentation approach can achieve comparative performances compared to 4-tag character tagging approach (represents the state-of-the-art segmentation approach). But it requires only about half training time and memory space as much as 4-tag character tagging approach. These results encourage that WBD segmentation approach is a good choice for real Chinese word segmentation systems.
منابع مشابه
Word segmentation in Persian continuous speech using F0 contour
Word segmentation in continuous speech is a complex cognitive process. Previous research on spoken word segmentation has revealed that in fixed-stress languages, listeners use acoustic cues to stress to de-segment speech into words. It has been further assumed that stress in non-final or non-initial position hinders the demarcative function of this prosodic factor. In Persian, stress is retract...
متن کاملA MMSM-based Hybrid Method for Chinese MicroBlog Word Segmentation
After years of researches, Chinese word segmentation has achieved quite high precisions for formal style text. However, the performance of segmentation is not so satisfying for MicroBlog corpora. In this paper we describe a scheme for Chinese word segmentation for, MicroBlog which integrates the characterbased and word-based information in the directed graph generated by MMSM model. Word-level ...
متن کاملChinese Word Segmentation based on Mixing Multiple Preprocessor and CRF
This paper describes the Chinese Word Segmenter for our participation in CIPSSIGHAN-2010 bake-off task of Chinese word segmentation. We formalize the tasks as sequence tagging problems, and implemented them using conditional random fields (CRFs) model. The system contains two modules: multiple preprocessor and basic segmenter. The basic segmenter is designed as a problem of character-based tagg...
متن کاملHMM Revises Low Marginal Probability by CRF for Chinese Word Segmentation
This paper presents a Chinese word segmentation system for CIPS-SIGHAN 2010 Chinese language processing task. Firstly, based on Conditional Random Field (CRF) model, with local features and global features, the character-based tagging model is designed. Secondly, Hidden Markov Models (HMM) is used to revise the substrings with low marginal probability by CRF. Finally, confidence measure is used...
متن کاملImproving Chinese Word Segmentation with Description Length Gain
Supervised and unsupervised learning has seldom joined with and thus lend strength to each other in the field of Chinese word segmentation (CWS). This paper presents a novel approach to CWS that utilizes description length gain (DLG), an empirical goodness measure for unsupervised word discovery, to enhance the segmentation performance of conditional random field (CRF) learning. Specifically, w...
متن کامل